Skip to content

Normalize write schema casing to table casing at catalog layer (Make OH reads/writes case-insensitive)#558

Merged
cbb330 merged 4 commits intolinkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-schema-writes
May 2, 2026
Merged

Normalize write schema casing to table casing at catalog layer (Make OH reads/writes case-insensitive)#558
cbb330 merged 4 commits intolinkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-schema-writes

Conversation

@pandaamit91
Copy link
Copy Markdown
Contributor

Summary

Writers (DaliSpark, Spark SQL, Trino DML, Java Iceberg API) may submit column names with different casing than what the table stores (e.g. "id" vs "ID"). Because validateWriteSchema is case-sensitive, these writes were rejected with InvalidSchemaEvolutionException even when the field IDs matched, making case-insensitive writes impossible without changing the table's existing column names.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Fix: in doUpdateSchemaIfNeeded, normalize the write schema's top-level column names to match the table schema's casing (matched by Iceberg field ID, not by name) before any comparison or storage. The table's existing casing is never mutated, and writers do not need to know or match the exact casing stored in the catalog.

Tables where two or more top-level columns share the same case-folded name (e.g. "id" and "ID") are excluded from normalization because the target column would be ambiguous, so writes to such tables must still use exact casing.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.

  • Added new tests for the changes made.

  • Updated existing tests to reflect the changes made.

  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.

  • Some other form of testing like staging or soak time in production. Please explain.

  • BaseIcebergSchemaValidatorTest: 10 unit tests covering hasCaseDuplicateFields and normalizeSchemaCasingToTable in isolation, including new-column passthrough and field attribute preservation.

  • RepositoryTest (H2 Spring integration): 2 new tests exercising the full save() path through the Spring context:

    • testCaseInsensitiveWrite_succeeds_andPreservesTableCasing: creates a table with "ID", saves an update with "id", asserts no exception and that the stored schema still shows "ID".
    • testCaseInsensitiveWrite_blockedForCaseDuplicateTable: creates a table with case-duplicate columns, asserts that a mismatched-casing save still throws InvalidSchemaEvolutionException.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Writers (DaliSpark, Spark SQL, Trino DML, Java Iceberg API) may submit
column names with different casing than what the table stores (e.g.
"id" vs "ID"). Because validateWriteSchema is case-sensitive, these
writes were rejected with InvalidSchemaEvolutionException even when the
field IDs matched, making case-insensitive writes impossible without
changing the table's existing column names.

Fix: in doUpdateSchemaIfNeeded, normalize the write schema's top-level
column names to match the table schema's casing (matched by Iceberg
field ID, not by name) before any comparison or storage. The table's
existing casing is never mutated, and writers do not need to know or
match the exact casing stored in the catalog.

Tables where two or more top-level columns share the same case-folded
name (e.g. "id" and "ID") are excluded from normalization because the
target column would be ambiguous, so writes to such tables must still
use exact casing.

Testing:
- BaseIcebergSchemaValidatorTest: 10 unit tests covering
  hasCaseDuplicateFields and normalizeSchemaCasingToTable in isolation,
  including new-column passthrough and field attribute preservation.
- RepositoryTest (H2 Spring integration): 2 new tests exercising the
  full save() path through the Spring context:
  - testCaseInsensitiveWrite_succeeds_andPreservesTableCasing: creates
    a table with "ID", saves an update with "id", asserts no exception
    and that the stored schema still shows "ID".
  - testCaseInsensitiveWrite_blockedForCaseDuplicateTable: creates a
    table with case-duplicate columns, asserts that a mismatched-casing
    save still throws InvalidSchemaEvolutionException.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@cbb330 cbb330 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously some tests were run which show that writes with different casing already succeeds.

So I want to validate whats the existing behavior before applying a fix. to do that, can you reproduce the issue in a separate PR under spark e2e tests? similar to this:

https://github.com/cbb330/openhouse/blob/22baa29fdc6fc043d1b2f6d024d3b2341b1f3bec/integrations/spark/spark-3.1/openhouse-spark-itest/src/test/java/com/linkedin/openhouse/spark/e2e/dml/IncrementalReadTest.java

@pandaamit91
Copy link
Copy Markdown
Contributor Author

@cbb330 here is the PR as requested, documenting existing casing behavior for all write paths - #562

pandaamit91 and others added 3 commits April 29, 2026 17:56
…isit

Replace the top-level-only loop implementations of hasCaseDuplicateFields
and normalizeSchemaCasingToTable with TypeUtil.SchemaVisitor-based
implementations that recurse through structs, list elements, and map
key/value types at any depth.

Key changes:
- hasCaseDuplicateFields: uses TypeUtil.visit to detect case-duplicate
  sibling fields at any nesting level; correctly treats same-named fields
  in different structs as non-duplicates; adds Locale.ROOT to toLowerCase
- normalizeSchemaCasingToTable: uses TypeUtil.indexById to build a single
  flat Map<Integer, NestedField> for the entire table schema in one O(n)
  walk, then TypeUtil.visit to rewrite field names at any depth; uses
  reference-equality short-circuit to avoid allocation for unchanged fields
- Both methods use Iceberg's visitor contract so new Type variants cause a
  compile error rather than a silent passthrough

Add tests covering nested struct, list element, and map value normalization
as well as the hasCaseDuplicateFields depth cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce SchemaValidationUtil in the open-source repo so both
open-source BaseIcebergSchemaValidator and li-openhouse share one
duplicate-detection implementation instead of maintaining divergent
subsets.

Changes:
- New SchemaValidationUtil with findDuplicateCaseInsensitiveColumnNames
  (returns conflict paths) and hasDuplicateCaseInsensitiveColumnNames
  (boolean), covering struct fields at any depth, list element types, and
  map key/value types; uses Locale.ROOT to fix the Turkish-i locale bug
- Remove hasCaseDuplicateFields from BaseIcebergSchemaValidator; the
  normalization guard in OpenHouseInternalRepositoryImpl now calls
  SchemaValidationUtil.hasDuplicateCaseInsensitiveColumnNames directly
- Move hasCaseDuplicateFields tests to SchemaValidationUtilTest; update
  the two integration-test precondition assertions in
  BaseIcebergSchemaValidatorTest to use the new utility

The two callers retain opposite semantics from the same predicate:
li-openhouse rejects writes when duplicates exist; the normalization guard
skips normalization (writes may still succeed for exact-casing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three gaps identified in the E2E and unit test suites:

Path B (RepositoryTest): writer submits wrong casing on an existing column
AND adds a new column in the same write. Normalization fixes the existing
column; validateWriteSchema then sees a valid evolution and must accept it.
This is the realistic migration scenario where a Spark/Trino client sends
a schema update alongside a column addition.

Path D (RepositoryTest): table has case-duplicate columns (guard skips
normalization). A write with exactly matching casing must still succeed
(sameSchema = true). Verifies the guard does not break legitimate writes
to legacy case-duplicate tables.

Unit (BaseIcebergSchemaValidatorTest): same as Path B but at the unit
level — normalize "id"→"ID", then confirm validateWriteSchema accepts
the resulting evolution (new column appended, existing IDs intact).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pandaamit91 pandaamit91 requested a review from cbb330 April 30, 2026 20:57
@cbb330 cbb330 merged commit 0ad4914 into linkedin:main May 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants